Method and apparatus for measuring similarity between documents

ABSTRACT

A measure of similarity between a first sequence of symbols and a second sequence of symbols is computed. Memory is allocated for a computational unit for storing values that are computed using a recursive formulation that computes the measure of similarity based on matching subsequences of symbols between the first sequence of symbols and the second sequence of symbols. A processor computes for the computational unit the values for the measure of similarity using the recursive formulation within which functions are computed using nested loops. The measure of similarity is output by the computational unit to an information processing application.

BACKGROUND OF INVENTION

[0001] The present invention relates generally to document informationretrieval, and more particularly to a method and apparatus for computinga measure of similarity between arbitrary sequences of symbols.

[0002] An important aspect of document information retrieval,classification, categorization, clustering, routing, cross-lingualinformation retrieval, and filtering is the computation of a measure ofsimilarity between two documents, each of which can be reduced to anarbitrary sequence of symbols. Most techniques for computing documentsimilarity require the computation of pair wise similarities over largesets of documents. Experiments have shown that the adopted similaritymeasure greatly influences performance of information retrieval systems.

[0003] One similarity measure known as the “string kernel” (alsoreferred to herein as the “sequence kernel”) is disclosed by: ChrisWatkins, in “Dynamic Alignment Kernels”, Technical Report CSD-TR-98-11,Department of Computer Science, Royal Holloway University of London,1999; Huma Lodhi, Nello Cristianini, John Shawe-Taylor and ChrisWatkins, in “Text Classification Using String Kernels”, Advances inNeural Information Processing Systems 13, the MIT Press, pp. 563-569,2001; and Huma Lodhi, Craig Saunders, John Shawe-Taylor, NelloCristianini, Chris Watkins, in “Text classification using stringkernels”, Journal of Machine Learning Research, 2:419-444, 2002, whichare all incorporated herein by reference.

[0004] Generally, the string kernel is a similarity measure between twosequences of symbols over the same alphabet, where similarity isassessed as the number of occurrences of (possibly noncontiguous)subsequences shared by two sequences of symbols; the more substrings incommon, the greater the measure of similarity between the two sequencesof symbols. The string kernel may be used to evaluate the similaritybetween different types of sequences of symbols (or “symbolic data”)such as sequences of: characters, words, lemmas, or other predefinedsets of terms (e.g., amino acids or DNA bases).

[0005] More specifically, the string kernel is referred to herein as afunction which returns the dot product of feature vectors of two inputsstrings. Feature vectors defined in a vector space is referred to as a“feature space”. The feature space of the string kernel is the space ofall subsequences of length “n” characters in the input strings. Thesubsequences of characters may be contiguous or noncontiguous in theinput strings. However, noncontiguous occurrences are penalizedaccording to the number of gaps they contain.

[0006] A limitation of existing implementations for computing the stringkernel is the memory required to carry out the computation. Knownimplementations for computing the string kernel of two sequences ofsymbols rely on a dynamic programming technique that requires computingand storing a large number of intermediate results. Such knownimplementations have used a technique which uses a variable (i.e., acomponent in a large array) for storing each intermediate result. Theseintermediate results require memory storage that is proportional in sizeto the product of the lengths of the sequences being compared.

[0007] Since existing techniques for computing this measure ofsimilarity between arbitrary sequences of symbols require a storageusage proportional to the product of the lengths of the sequences beingcompared, it would be advantageous therefore to provide a technique forcomputing a string kernel that reduces the storage usage requirement ofexisting techniques to enable the computation of the string kernel forlonger sequences of symbols.

SUMMARY OF INVENTION

[0008] In accordance with the invention, there is provided a method,apparatus and article of manufacture therefor, for computing a measureof similarity between two sequences of symbols known as the stringkernel. In accordance with one aspect of the invention, the geometry ofthe data dependencies in matrices used to represent the intermediateresults of the similarity computation (i.e., what component in thematrices is needed to compute what other component in the matrices) isselected that allows most intermediate results stored in a memory of acomputational unit to be deleted from its memory shortly after theircomputation by the computational unit.

[0009] In accordance with this aspect of the invention, the selectedgeometry defines an order in which the computational unit computesintermediate values in a “diagonal order”. The computation using thediagonal order requires that only those values which are on the diagonalitself be stored in the memory of the computational unit, therebypermitting the computation of the string kernel using an amount ofmemory that is proportional to the sum of the lengths of the sequencesof symbols for which the similarity measure is computed. Advantageously,the diagonal computation order for computing intermediate results of thestring kernel decreases the memory requirements for carrying out thecomputation of the string kernel between sequences of symbols of anygiven length.

[0010] In accordance with another aspect of the invention, there isprovided a dynamic programming method, apparatus and article ofmanufacture therefor, for computing a measure of similarity between afirst sequence of symbols and a second sequence of symbols. Memory isallocated for a computational unit for storing values that are computedusing a recursive formulation that computes the measure of similaritybased on matching subsequences between the first sequence of symbols andthe second sequence of symbols. A processor computes for thecomputational unit the values for the measure of similarity using therecursive formulation within which functions are computed using nestedloops that include: an outer loop that ranges over increasing sums ofprefix lengths of the first sequence of symbols and the second sequenceof symbols, a middle loop that ranges over increasing prefixes of thefirst sequence of symbols, for each sum of prefix lengths of the outerloop, and an inner loop that ranges over increasing subsequence lengths,for each prefix of the first sequence of symbols of the middle loop. Themeasure of similarity is output by the computational unit to aninformation processing application.

BRIEF DESCRIPTION OF DRAWINGS

[0011] These and other aspects of the invention will become apparentfrom the following description read in conjunction with the accompanyingdrawings wherein the same reference numerals have been applied to likeparts and in which:

[0012]FIG. 1 illustrates a general purpose computer for carrying out thepresent invention;

[0013]FIG. 2 illustrates one embodiment in which the informationprocessing application and the sequence similarity computation unitshown in FIG. 1 operate together;

[0014]FIG. 3 illustrates contributions from different occurrences ofcommon subsequences in strings s and t;

[0015]FIG. 4 illustrates recursion used for computing the similaritybetween the sequences of symbols p=GATTACA and q=ACTAGTT;

[0016]FIG. 5 sets forth pseudo code depicting computational operationsof a direct method for performing the recursive formulation of thestring kernel;

[0017]FIG. 6 illustrates the order in which the direct method shown inFIG. 5 computes values for K″;

[0018]FIG. 7 illustrates data dependencies between K, K′, and K″ in therecursive formulation of the sequence kernel;

[0019]FIG. 8 sets forth pseudo code depicting computational operationsof a diagonal method for performing the recursive computation of thestring kernel; and

[0020]FIG. 9 illustrates the order in which the diagonal method shown inFIG. 8 computes values for K″.

DETAILED DESCRIPTION OF INVENTION

[0021] A. Operating Environment

[0022]FIG. 1 illustrates a general purpose computer 110 for carrying outthe present invention. The general purpose computer 110 includeshardware 112 and software 114. The hardware 112 is made up of aprocessor (i.e., CPU) 116, memory 118 (ROM, RAM, etc.), persistentstorage 120 (e.g., CD-ROM, hard drive, floppy drive, tape drive, etc.),user I/O 122, and network I/O 124. The user I/O 122 can include akeyboard 126, a pointing device 128 (e.g., pointing stick, mouse, etc.),and the display 130. The network I/O 124 may for example be coupled to anetwork 132 such as the Internet. The software 114 of the generalpurpose computer 110 includes an operating system 136, a sequencesimilarity computation unit 138 and an information processingapplication 140.

[0023]FIG. 2 illustrates one embodiment in which the informationprocessing application 140 and the sequence similarity computation unit138 operate together. The information processing application 140identifies textual content stored in sequence data memory 210 for whicha measure of similarity is desired. The sequence similarity computationunit 138 receives as input at 212 identifiers of two sequences of symboldata. The sequence similarity computation unit 138 computes, usingprocessor 116 and memory 118, for the information processing application140 a measure of similarity at 214 of the two sequences of symbols fromthe memory 210 at 216. The information processing application 140 maythen use the measure of similarity 214 for information clustering,classification, cross-lingual information retrieval, routing, textcomparison, and/or filtering.

[0024] In an alternate embodiment, the information processingapplication 140 and the sequence similarity computation unit 138 areembedded together in one or more software modules. In yet anotherembodiment, the information processing application 140 operates on thegeneral purpose computer 110 that transmits the measure of similarity214 over the network 132 to another general purpose computer, alsocoupled to the network 132, on which the information processingapplication 140 operates.

[0025] B. Mathematical Framework of the String Kernel

[0026] In accordance with the invention, the sequence similaritycomputation unit 138 computes for given sequence data 216 a measure ofsimilarity 214 using the string kernel. This section sets forth basicnotations and definitions for the mathematical framework of the stringkernel.

[0027] Let Σ be a finite alphabet, and let s=s₁s₂ . . . s_(|s|) be asequence of symbols over such alphabet (i.e., s_(i) ε Σ, 1≦i≦|s|). Leti=[i₁,i₂, . . . ,i_(n)], with 1≦i₁<i₂< . . . <i_(n)≦|s|, be a subset ofthe indices in s, where s[i] ε Σ^(n) identifies the contiguous ornoncontiguous subsequence s_(i1), s_(i2), . . . , s_(in) of symbols.Also, let 1(i) be the value i_(n)−i₁+1 (i.e., the length of the windowin s spanned by s[i]).

[0028] Computing a string kernel amounts to performing an inner productin a feature space of all possible subsequences of length n, with onedimension for each subsequence uεΣ^(n), where the value associated withthe feature u is defined by:

φ_(u)(s)=Σ_(i:u=s[i])λ^(1(i)),

[0029] where λ is a real number between zero and one indicating thedecay factor for each gap in subsequence occurrences. The decay factor λis used to penalize noncontiguous subsequences. For example if λ isgiven the value one, noncontiguous subsequences with gaps betweenmatching symbols are taken into account with no penalty when computingthe value of the similarity. However, if λ is given the value of 0.5,then gap symbols (i.e., symbols in noncontiguous subsequences thatcreate gaps between matching symbols) contribute to the value of thesimilarity by dividing the contribution of the match they appear in bytwo each.

[0030] The string kernel (i.e., similarity K_(n) where n is a fixedpositive integer) of two strings s and t over the finite alphabet Σ isdefined as: $\begin{matrix}{{K_{n}\left( {s,t} \right)} = {{\sum\limits_{u \in \Sigma^{n}}{{\varphi_{u}(s)} \cdot {\varphi_{u}(t)}}} = {{\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:u} = {s{\lbrack i\rbrack}}}{\lambda^{l{(i)}}{\sum\limits_{{j:u} = {s{\lbrack j\rbrack}}}\lambda^{l{(j)}}}}}} = {\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:u} = {s{\lbrack i\rbrack}}}{\sum\limits_{{j:u} = {t{\lbrack j\rbrack}}}\lambda^{{l{(i)}} + {l{(j)}}}}}}}}} & \lbrack 1\rbrack\end{matrix}$

[0031] B.1 Example Computation of the String Kernel

[0032] Intuitively, the computation of the string kernel involves thematching of all possible subsequences of “n” symbols, with eachoccurrence “discounted” according to the size of the window that itspans. Consider for example the alphabet Σ={A,C,G,T}, and the twoelementary sequences:

s=CATG and t=ACATT.

[0033] In this example, the similarity between the sequence s and t ismeasured for all subsequences (or features) of length n=3. The nonzerosubsequences u of the vectors for s and t in the feature space wouldthen be given as set forth in Table 1. TABLE 1 u s t AAT 0 λ⁴ + λ⁵ ACA 0λ³ ACT 0 λ⁴ + λ⁵ ATG λ³ 0 ATT 0 λ³ + λ⁵ CAG λ⁴ 0 CAT λ³ λ³ + λ⁴ CTG λ⁴ 0CTT 0 λ⁴

[0034] As shown in Table 1, the only subsequence for which both sequences and t have a non-null value is CAT. More specifically, the value ofthe feature u=AAT for sequence t=ACATT is λ⁴+λ⁵ because there are twooccurrences of AAT in ACATT. The first occurrence of AAT spans a windowof width four (i.e., first, third, and fourth symbols) and the secondoccurrence of AAT spans a window of width five (i.e., first, third, andfifth symbols). The similarity score is then obtained by multiplying thecorresponding components of the subsequences in the two sequence s and tand then summing the result, as given by:

K ₃(CATG,ACATT)=λ³(λ³+λ⁴).

[0035] B.2 Computation of the String Kernel Using Dynamic Pogramming

[0036] A direct computation of all the terms in the nested sum in theequation [1] of the string kernel becomes impractical even for smallvalues of n. There is, however, a recursive formulation that leads to amore efficient dynamic-programming implementation as disclosed by Lodhiet al. in the publications cited above and incorporated herein byreference.

[0037] The recursive formulation is based on the following reasoning. Ifthe value of the string kernel for two string s and t is known, then thefollowing two observations can be made regarding the computation of thevalue of the string kernel sx and t, for some x ε Σ (where sx is thesequence obtained by appending the symbol x to the end of the sequenceof symbols s):

[0038] (1) all subsequences common to s and t are also common to sx andt; and

[0039] (2) all new matching subsequences ending in x which occur instring t and whose (n−1)-symbol prefix occur in string s (possiblynoncontiguously) must be considered to compute the value of the stringkernel sx and t.

[0040]FIG. 3 illustrates contributions from different occurrences ofcommon subsequences in strings s and t, identified as 302 and 304respectively. More specifically, FIG. 3 illustrates the observation (2)above in which u′ and u″ are two distinct subsequences of length n−1that occur in both strings s and t. As occurrences of u′ and u″ instrings s and t can contain gaps, each occurrence of u′ and u″ spans, inprinciple, a window of different length as illustrated in FIG. 3.Focusing on the occurrence of u″ in s identified at 306, if i^(a) is theset of indices of the occurrence of u″ in s (i.e., if u″=s[i^(a)]), thenthe length of the window l(i^(a)) spanned by the occurrence of u″ in sidentified at 306 is given by l(i^(a))=i_(n−1) ^(a)−i₁ ^(a)+1.

[0041] Together with the occurrence of u″ in s identified at 306, theoccurrence of u″ in t identified at 308 gives rise to two new matches oflength n for the subsequence u″x between sx and t, due to the occurrenceof u″ in t identified at 308 and the occurrences of x in t at 310 and312. The two new matches (i.e., 306 concatenated to 314 matching 308concatenated to 310, and 306 concatenated to 314 matching 308concatenated to 312) contribute to the string kernel (defined at [1])according to their respective lengths:

λ²(λ^(|s|−i) ^(₁) ^(a) ⁺¹λ^(j) ^(_(l1)) ^(−j) ^(₁) ^(b) +λ^(|s|−i) ^(₁)^(a) ⁺¹λ^(j) ^(_(l2)) ^(−j) ^(₁) ^(b) ),

[0042] where j^(b) is the set of indices of u″ in t identified at 308,and j_(l1) and j_(l2) are the indices of the occurrences of x in t at310 and 312. Note that the λ² factor is the contribution of the matchingx themselves and the rest is the contribution of the occurrences of u″and of the gaps to the string kernel. Similar inputs will be given byall occurrences of u′ and u″ and all other subsequences of length n−1 inthe two strings s and t.

[0043] Thus, the string kernel in equation [1] can be rewritten for sxand t as: $\begin{matrix}\begin{matrix}{{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:{s{\lbrack i\rbrack}}} = u}{\sum\limits_{{j:t_{j}} = x}{\sum\limits_{{{1:{t{\lbrack 1\rbrack}}} = u},{l_{n - 1} < j}}{\lambda^{{s} + 1 - i_{1} + 1}\lambda^{j - l_{1} + 1}}}}}}}} \\{= {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{\sum\limits_{u \in \Sigma^{n - 1}}{\sum\limits_{{i:{s{\lbrack i\rbrack}}} = u}{\sum\limits_{{{1:{t{\lbrack 1\rbrack}}} = u},{l_{n - 1} < j}}{\lambda^{{s} - i_{1} + 1}\lambda^{j - 1 - l_{1} + 1}}}}}}}}}\end{matrix} & \lbrack 2\rbrack\end{matrix}$

[0044] It is noted that the part of the second term in the equation [2]within the three innermost sums looks quite similar to the definition ofthe string kernel for sequences of length n−1, although the contributiondecays over |s|−i₁+1 and j−1+l₁+1 rather than i_(n)−i₁+1 and l_(n)−l₁+1as set forth in the string kernel in equation [1]. Defining:${{K_{n - 1}^{\prime}\left( {s,t} \right)} = {\sum\limits_{u \in \Sigma^{n - 1}}{\sum\limits_{{i:u} = {s{\lbrack i\rbrack}}}{\sum\limits_{{j:u} = {t{\lbrack j\rbrack}}}{\lambda^{{s} - i_{1} + 1}\lambda^{{t} - j_{1} + 1}}}}}},$

[0045] then equation [2] can be rewritten as:${{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}}}}},$

[0046] where t[1:j−1] refers to the first j−1 symbols of t. Intuitively,K_(n−1) ^(′)(s,t) counts matching subsequences of n−1 symbols, butinstead of discounting them according to the length of the window theyspan as in K_(n−1)(s,t), it discounts them according to the distancefrom the first symbol in the subsequence to the end of the completesequence.

[0047] It follows that the values of K_(n) ^(′) can be calculated usingthe following recursive equation: $\begin{matrix}\begin{matrix}{{K_{n - 1}^{\prime}\left( {{sx},t} \right)} = {{\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:u} = {s{\lbrack i\rbrack}}}{\sum\limits_{{j:u} = {t{\lbrack j\rbrack}}}{\lambda^{{({{s} + 1})} - i_{1} + 1}\lambda^{{t} - j_{1} + 1}}}}} +}} \\{{\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:v} = {s{\lbrack i\rbrack}}}{\sum\limits_{{j:t_{j}} = x}{\sum\limits_{{{j:t_{\lbrack j\rbrack}} = v},{j_{i - 1} < j}}{\lambda^{{({{s} + 1})} - i_{1} + 1}\lambda^{{t} - j_{1} + 1}}}}}}} \\{= {{\lambda {\sum\limits_{u \in \Sigma^{n}}{\sum\limits_{{i:u} = {s{\lbrack i\rbrack}}}{\sum\limits_{{j:u} = {t{\lbrack j\rbrack}}}{\lambda^{{s} - i_{1} + 1}\lambda^{{t} - j_{1} + 1}}}}}} +}} \\{{\sum\limits_{{j:t_{j}} = x}{\sum\limits_{v \in \Sigma^{n - 1}}{\sum\limits_{{i:v} = {s{\lbrack i\rbrack}}}{\sum\limits_{{{j:{t{\lbrack j\rbrack}}} = v},{j_{n - 1} < j}}{\lambda^{{s} - i_{1} + 1}\lambda^{{({j - 1})} - j_{1} + 1}\lambda^{{t} - j + 2}}}}}}} \\{= {{\lambda \quad {K_{n}^{\prime}\left( {s,t} \right)}} + {\sum\limits_{{j:t_{j}} = x}{{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}\lambda^{{t} - j + 2}}}}}\end{matrix} & \lbrack 3\rbrack\end{matrix}$

[0048] B.3 Examples Using Recursive Formulation

[0049] In a first example, when computing K₂ ^(′)(s,t) using thealphabet Σ={A,C,G,T} set forth in section B.1 above, the similaritybetween the sequence s (i.e., s=CATG) and t (t=ACATT) is measured forall subsequences (i.e., features) of length n=2. The computation ofK_(i) ^(′) is similar to K_(i) in the example set forth in section B.1above, except that the computation of K_(i) ^(′) common subsequences isdiscounted according to the distance from the first matching symbol ofthe sequence to the last symbol of the sequence, instead of discountingcommon subsequences according to the distance from the first matchingsymbol of the sequence to the last matching symbol of the sequence. Inthis first example, the nonzero subsequences u of the vectors for s andt in the feature space would then be as set forth in Table 2. TABLE 2 us t AA 0 λ⁵ AC 0 λ⁵ AG λ³ 0 AT λ³ 2λ³ + 2λ⁵ CA λ⁴ λ⁴ CC 0 0 CG λ⁴ 0 CTλ⁴ 2λ³  GA 0 0 GC 0 0 GG 0 0 GT 0 0 TA 0 0 TC 0 0 TG λ² 0 TT 0 λ²

[0050] For instance as set forth in Table 2, the value for the featureu=CT for the sequence t=ACATT is 2λ⁴ (i.e., λ⁴+λ⁴) because bothoccurrences of CT start on the second symbol C which is four symbolsaway from the end of the sequence t. Hence:

K ₂ ^(′)(CATG,ACATT)=λ³(2λ³+2λ⁵)+λ⁴λ⁴+λ⁴(2λ⁴)=2λ⁶+5λ⁸.

[0051] In another example, FIG. 4 illustrates recursion used forcalculating the similarity between the sequence of symbols p=GATTACA andq=ACTAGTT. Given the two sequences of symbols p and q, a binary matrix Mcan be defined such that: $M_{ij}\left\{ {\begin{matrix}1 & {{{if}\quad p_{i}} = q_{j}} \\o & {otherwise}\end{matrix}.} \right.$

[0052] The value K_(n) ^(′)(sx,t) (e.g., K₂ ^(′)(ACTA,GATTA) in FIG. 4)counts the number of matches of sequences of length n (e.g., n=2 in FIG.4) appropriately discounted from the first element of the match to theend of the two sequences. In the example shown in FIG. 4, twosubsequence matches are taken into account by the term K₂^(′)(ACT,GATTA) (i.e., a first is given by <p₁=q₂=A(M₁₂), p₃=q₃=T(M₃₃)>,and a second given by <p₁=q₂=A(M₁₂), p₃=q₄=T(M₃₄)>), appropriatelydiscounted by λ for the additional distance to the end of the sequencesx caused by the final x=A. In addition, three more subsequence matches,appropriately discounted as well, are taken into account by the term K₁^(′)(ACT,GATT) (i.e., a first is given by <p₁=q₂=A(M₁₂), p₄=q₅=A(M₄₅)>,a second is given by (p₃=q₂=T(M₃₂), p₄=q₅=A(M₄₅)>, and a third given by<p₃=q₄=T(M₃₄), p₄=q₅=A(M₄₅)>). It is noted that the contribution of theterm K₁ ^(′)(ACT,G) is null, as none of the symbols in ACT matches thesymbol G.

[0053] Intuitively, K_(n−1) ^(′)(s,t) is used by the recursive equation[3] to store, as an intermediate result, the total discounted “mass” ofmatches of length n−1 ready to be turned into matches of length n shouldthe next symbol in s match some of the symbols in t. This “mass” ispropagated according to the recursive equation [3]. In terms of thematrix shown in FIG. 4, this means that values of K_(n) ^(′) arepercolated from left to right along the rows, discounting them by anadditional λ for each new step. Moreover, if the position at which thevalue is computed corresponds itself to a symbol match, then the valueis accrued by the masses of sequences of length n−1 stored in theimmediate previous column and in the rows from one to one less than thecurrent position.

[0054] B.4 Extended Recursive Formulation

[0055] To summarize, the recursive formulation in section B.2 is givenby the following equations: $\begin{matrix}{{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}}}}} & \lbrack 4\rbrack \\{{{K_{i}^{\prime}\left( {{sx},t} \right)} = {\lambda \quad {K_{i}^{\prime}\left( {s,t} \right)}{\sum\limits_{{j:t_{j}} = x}{{K_{i - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}\lambda^{{t} - j + 2}}}}},\quad \left( {{i = 1},\ldots \quad,{n - 1}} \right)} & \lbrack 5\rbrack\end{matrix}$

[0056] which has base values defined as:

K ₀ ^(′)(s,t)=1, for all s,t

K _(i) ^(′)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1)

K _(n)(s,t)=0, if min(|s|,|t|)<n.

[0057] The time required to compute the string kernel using therecursive equations [4] and [5] is given by O(n|s∥t|²). This can be seenby observing that the outermost recursion is for increasing values ofsubsequence lengths (i=1, . . . , n−1), and that for each length andeach additional symbol in s and in t, a sum over the whole prefix of tup to the position being considered is required.

[0058] This recursive formulation for computing the string kerneldefined by equations [4] and [5], although significantly more efficientthan the formulation defined by equation [1], can be further improved byobserving that the sum component in the evaluation of the recursiveequation [5] (i.e., K_(i)(sx,t)) can also be computed incrementally.That is, the complexity can be reduced by storing intermediate values ofthe sum of the recursive equation [5]. Another recursive equation maythus be defined as:${{K_{i}^{''}\left( {{sx},t} \right)} = {\sum\limits_{{j:t_{j}} = x}{{K_{i - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}\lambda^{{t} - j + 2}\quad \left( {{i = 1},\ldots \quad,{n - 1}} \right)}}},$

[0059] which has base values defined as:

K ₀ ^(″)(s,t)=0, for all s,t

K _(i) ^(″)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1).

[0060] Intuitively, the recursive equation K_(i) ^(″)(sx,t) stores thesum of the discounted masses of matches of subsequences of length i−1ending somewhere in the column just before the one being considered inthe matrix and in some previous row. It follows that: $\begin{matrix}\begin{matrix}{{K_{i}^{''}\left( {{sx},{ty}} \right)} = {\sum\limits_{{j:t_{j}} = x}{{K_{i - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}\lambda^{{({{t} + 1})} - j + 2}}}} & \quad \\{= {\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}}} & {\left( {{{if}\quad x} \neq y} \right)}\end{matrix} & \lbrack 6\rbrack \\\begin{matrix}{{K_{i}^{''}\left( {{sx},{tx}} \right)} = {{\sum\limits_{{j:t_{j}} = x}{{K_{i - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}\lambda^{{({{t} + 1})} - j + 2}}} +}} & \quad \\{{{K_{i}^{''}\left( {{sx},t} \right)}\lambda^{{({{t} + 1})} - {({{t} + 1})} + 2}}} & \quad \\{= {{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} + {\lambda^{2}{K_{i - 1}^{\prime}\left( {s,t} \right)}}}} & {({otherwise})}\end{matrix} & \lbrack 7\rbrack\end{matrix}$

[0061] The recursive equations [6] and [7] can be rewritten together as:$\begin{matrix}{{K_{i}^{''}\left( {{sx},{ty}} \right)} = \left\{ \begin{matrix}{\lambda \left( {{K_{i}^{''}\left( {{sx},t} \right)} + {\lambda \quad {K_{i - 1}^{\prime}\left( {s,t} \right)}}} \right)} & {{{if}\quad x} = y} \\{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} & {otherwise}\end{matrix} \right.} & \lbrack 8\rbrack\end{matrix}$

[0062] The recursive equation for K_(i) ^(′)(sx,t) can thus be expressedas a function of K^(′) as:

K _(i) ^(′)(sx,t)=λK _(i) ^(′)(s,t)+K _(i) ^(″)(sx,t) (i=1 , . . . ,n−1)  [9]

[0063] By introducing K″, a single sum for updating K″ is sufficient foreach new element s and t, instead of a sum over all values forJ:t_(j)=x. The overall time complexity for computing the string kernelusing these equations thus reduces to O(n|s∥t|).

[0064] B.5 Summary Of Extended Recursive Formulation

[0065] The measure of similarity K_(n) that matches subsequences ofsymbols of length n between a first sequence of symbols and a secondsequence of symbols, where the measure of similarity K_(n) is a firstrecursive equation (set forth above as equation [4]), is given by:${{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}}}}},{where}$

[0066] n: is a subsequence length,

[0067] K_(n): is the measure of similarity for subsequences of length n,

[0068] s: is any sequence of symbols,

[0069] t: is any other sequence of symbols,

[0070] sx: is the sequence of symbols obtained by appending the symbol xto the end of the sequence of symbols s,

[0071] λ: is a decay factor for penalizing matching subsequences thatare noncontiguous (0≦λ≦1);

[0072] t[1:j−1]: refers to the first j−1 symbols of t,$\sum\limits_{{j:t_{j}} = x}:$

[0073] is a sum that ranges over all indices j such that t_(j) (i.e.,the j^(th) symbol of t) is x,

[0074] K′: is a function defined by a second recursive equation (setforth above as equation [9]) given by:

K _(i) ^(′)(sx,t)=ΔK _(i) ^(′)(s,t)+K _(i) ^(″)(sx,t) (i=1, . . . ,n−1),and

[0075] K″: is a function defined by a third recursive equation (setforth above as equation [8]) given by:${K_{i}^{''}\left( {{sx},{ty}} \right)} = \left\{ {\begin{matrix}{\lambda \left( {{K_{i}^{''}\left( {{sx},t} \right)} + {\lambda \quad {K_{i - 1}^{\prime}\left( {s,t} \right)}}} \right)} & {{{if}\quad x} = y} \\{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} & {otherwise}\end{matrix}.} \right.$

[0076] The base values for the recursive equations [4] and [8] aredefined as:

K ₀ ^(″)(s,t)=0, for all s,t

K _(i) ^(″)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1)

K ₀ ^(′)(s,t)=1, for all s,t,

K _(i) ^(′)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1)

K _(n)(s,t)=0, if min(|s|,|t|)<n,

[0077] where |s| is the cardinality of s, and |t| is the cardinality oft.

[0078] C. Direct Computation of the Recursive Formulation

[0079] In one embodiment, the recursive computation of the string kerneldefined by the recursive equations [4], [9], and [8] set forth insection B.5 is implemented directly (herein referred to as “the directmethod”). FIG. 5 sets forth pseudo code depicting computationaloperations (e.g., carried out by processor 116 in FIG. 1) of the directmethod for performing the recursive computation (i.e., Direct(s,t,λ,N)returns K_(N)(s,t)).

[0080] The direct method represents intermediate results usingthree-dimensional matrices, and computes the matrices for K′ and K″layer after layer following a path parallel to the edges within eachlayer. This computational order requires that a full layer for eachstructure be stored in memory at any given time, thus leading to memoryrequirements proportional to the products of the lengths of thesequences. being evaluated.

[0081] More specifically, the direct method requires N arrays, eachhaving the dimension |s|×|t|, for storing the intermediate results of K′and K″ (e.g., in memory 118 while the processor 116 carries outinstructions corresponding to the direct method in FIG. 5). Despite theadopted indexing convention, N arrays of dimensions |s| are sufficientfor storing the intermediate results of K. The space complexity of thedirect method is therefore O(N|s∥t|).

[0082] With specific reference now to K′ and K″, FIG. 6 illustrates theorder in which the direct method shown in FIG. 5 computes values for K″(and K′ similarly although not shown in the FIG. 6) when the length ofthe string s and t are both equal to eight and the subsequence length isn=3. More specifically, as depicted in FIG. 6 for K₁ ^(″), K₂ ^(″), andK₃ ^(″), three nested loops (i.e., an outer loop, a middle loop, and aninner loop) index over K′ and K″ as follows:

[0083] (a) the outer loop performs computations for increasingsubsequence length n, and

[0084] (b) for each n, the middle loop performs computations forincreasing prefixes of s, and

[0085] (c) for each prefix of s, the inner loop performs computationsfor increasing prefixes of t.

[0086] The order in which the values for K′ and K″ are computed can bedefined by an integer function “ord” such that ord(n′,i′,j′) is lessthan ord(n″,i″,j″) if and only if, whenever K_(n) _(^(′))^(′)(i^(′),j^(′)) and K_(n) _(^(″)) ^(′)(i^(″),j^(″)) are both computedK_(n) _(^(′)) ^(′)(i^(′),j^(′)) is computed before K_(n) _(^(″))^(′)(i^(″),j^(″)) (and similarly for K″). In the case of the directcomputation, this integer function is given by (as shown for the examplein FIG. 6):

ord_(direct)(n,i,j)=(n−1)(max(|s|,|t|)²+(i−n)max(|s|,|t|) +j−n+1,

[0087] where (ij,n) defines a positions in the matrix K_(n) ^(′) orK_(n) ^(″), as shown in FIG. 6 for K_(n) ^(″).

[0088] D. Diagonal Computation of the Recursive Formulation

[0089]FIG. 7 illustrates the data dependencies between K, K′, and K″ inthe recursive formulation of the sequence kernel set forth in sectionB.5. Specifically, the data dependencies in FIG. 7 can be obtained byobserving that K_(n) ^(″)(i,j) is needed only when computing K_(n)^(″)(i,j+1) and K_(n) ^(′)(i,j), as indicated by reference numbers 702and 704 respectively. Once these computations have completed, there isno longer any need to store K_(n) ^(″)(i,j) in memory. In addition, FIG.7 shows that K_(n) ^(′)(i,j) is needed only when computing K_(n)^(′)(i+1,j), K_(n+1) ^(″)(i+1,j+1), and K_(n+1)(i+1,|t|), as indicatedby reference numbers 706, 708, and 710 respectively.

[0090] In another embodiment, the recursive computation of the stringkernel defined by the recursive equations set forth in section B.5 isimplemented using a more space efficient computational method (hereinreferred to as “the diagonal method”). FIG. 8 sets forth pseudo codedepicting computational operations (e.g., carried out by processor 116in FIG. 1) of the diagonal method for performing the recursivecomputation (i.e., Diagonal(s,t,λ,N) returns K_(N)(s,t)).

[0091] While the direct method set forth in FIG. 5 and the diagonalmethod set forth in FIG. 8 yield the same output (i.e., K_(N)(|s|,|t|)),the diagonal method computes values of the recursive equation [9] for K′and the recursive equation [8] for K″ in a different order.

[0092] With specific reference now to K′ and K″, FIG. 9 illustrates theorder in which the diagonal method shown in FIG. 8 computes values forK″ (and K′ similarly although not shown in the FIG. 9) when the lengthof the strings s and t are both equal to eight and the subsequencelength is n=3. The computational order of the diagonal method (hereinreferred to as “the diagonal order”) can be visualized intuitively assweeping a diagonal across each of the matrices K′ and K″ shown in FIG.9 and computing the values for all layers of K₁ ^(″), K₂ ^(″), and K₃^(″) at any given position before moving to the next position (e.g., seeorder K₁ ^(″)(1,1), K₂ ^(″)(2,2), and K₃ ^(″)(3,3), etc.).

[0093] In the direct method every value in K′ and K″ for a givensubsequence length n is required for computing at least one value in K′and K″ for subsequences of length n+1, and all values for length n arecomputed before any value for length n+1. Consequently, there must besufficient memory for storing O(n|s∥t|) values for each of K′ and K″ asset forth in the inner loop of the direct method shown in FIG. 5.

[0094] Advantageously, when the diagonal order is followed there is nolonger a need for storing in memory O(n|s∥t|) values for each of K′ andK″, as values can be “forgotten” as soon as they are no longer neededfor any other recursive computation. More specifically, an array inmemory of size n×|s| is sufficient for computing values for K″ and anarray in memory of size n×|t| is sufficient for computing values for K′.This can be seen in FIG. 8 for the partial values stored in memory forK_(n) ^(″)(i) at 814 and 816 and for K_(n) ^(′)(j) at 818.

[0095] The diagonal method proceeds as shown in FIG. 8 by initializingat 802 base values for K, K′ and K″. (The base values for K″ in theexample shown in FIG. 9 initialized at 802 are identified with hashmarks.)

[0096] At 804 in FIG. 8, three nested loops (i.e., an outer loop 806, amiddle loop 808, and an inner loop 810) index over K′ and K″ as follows:

[0097] (a) the outer loop performs computations for increasing sums ofprefix lengths of s and t (identified in FIG. 9 as indices 902), and

[0098] (b) for each sum of prefix lengths, the middle loop performscomputations for increasing prefixes of s, and

[0099] (c) for each prefix of s, the inner loop performs computationsfor increasing subsequence lengths n.

[0100] As set forth in the recursive equation [4], each intermediatevalue K_(n) ^(′)(i,j) is also needed for computing K_(n+1)(i+1,|t|).Following the diagonal computational order, it would be impossible tocompute the values for K_(n+1)(i+1,j) (0≦j≦|t|) after having computedall the values of K′, as many relevant values of K′ would no longer beavailable. The structure of the recursive equation [4] suggests,however, that the values for K_(n)(i,|t|) can be computed incrementally.Generally, an array in memory of size O(n×|s|) is sufficient for storingintermediate values for K that are computed incrementally.

[0101] Subsequently at 820, whenever indices i and j such thats_(i)=t_(j) are met during the computation, K_(n+1)(i+1) is incrementedby K_(n) ^(′)(i). Thus, at the end of the outer loop in the recursivecomputation, each K_(n)(i,|t|) contains the value given by:${K_{n}\left( {i,{t}} \right)} = {\sum\limits_{{j:t_{j}} = S_{i}}{{K_{n - 1}^{\prime}\left( {{i - 1},{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}.}}$

[0102] Finally at 812, the K_(n)(i,|t|) values are considered in turnfor increasing values of i, and each one is multiplied by λ² and thenincremented by the value of K_(n)(i−1,|t|) computed immediately before.This ensures that eventually the value stored in each K_(n)(i,|t|) isactually the one required by the recursive formulation.

[0103] Similar to the integer function ord_(direct) given above for thedirect computation, the integer function that provides a total orderingof the computation of values for K′ and K″ for the diagonal computationis given by (as shown for the example in FIG. 9):${{{ord}_{diagonal}\left( {n,i,j} \right)} = {{N\frac{\left( {1 + j - {2n}} \right)\left( {i + j - {2n} + 1} \right)}{2}} + {N\left( {i - n} \right)} + n}},$

[0104] where (ij,n) defines a positions in the matrix K_(n) ^(′) orK_(n) ^(″) for which N is the length of the largest subsequence lengthcomputed, as shown in FIG. 9 for K_(n) ^(″).

[0105] To recapitulate with reference again to FIGS. 1, 2 and 8, thesequence similarity computation unit 138 computes a measure ofsimilarity 214 between a first sequence of symbols and a second sequenceof symbols received at 216. Memory 118 is allocated for thecomputational unit 118 for storing values that are computed using arecursive formulation, of which memory is allocated of sizeO((|s|+|t|)×n) for storing intermediate values of K, K′, and K″. Inaddition, memory 118 is allocated for storing processing instructions ofthe computational unit 138 (e.g., with functionality similar to thatwhich is shown in FIG. 8) for carrying out the recursive formulationthat computes the measure of similarity based on matching subsequencesof symbols of length n between the first sequence of symbols having alength |s| and the second sequence of symbols having a length |t|. Theprocessor 116 coupled to the memory 118 executes the processinginstructions of computational unit 138.

[0106] The processor 118 in executing the processing instructionscomputes the values for the measure of similarity using the recursiveformulation within which functions are computed using nested loops (asdefined above in section B.5). An outer loop ranges over increasing sumsof prefix lengths of the first sequence of symbols and the secondsequence of symbols (e.g., as shown at 806 in FIG. 8). A middle loopranges over increasing prefixes of the first sequence of symbols, foreach sum of prefix lengths of the outer loop (e.g., as shown at 808 inFIG. 8). An inner loop ranges over increasing subsequence lengths, foreach prefix of the first sequence of symbols of the middle loop (e.g.,as shown at 810 in FIG. 8). The computed measure of similarity 214 isthen output by the computation unit 138 to the information processingapplication 140.

[0107] E. Miscellaneous

[0108] The use of the terms “string” and “sequence of symbols” are usedinterchangeably herein to specify a concatenation of symbols (or symboldata). The symbols in a sequence of symbols may encode any set of termsincluding but not limited to: alphanumeric characters (e.g., alphabeticletters, numbers), symbols, words, lemmas, music notes or scores,biological or chemical formulations (e.g., amino acids or DNA bases),and kanji characters.

[0109] Using the foregoing specification, the invention may beimplemented as a machine (or system), process (or method), or article ofmanufacture by using standard programming and/or engineering techniquesto produce programming software, firmware, hardware, or any combinationthereof.

[0110] Any resulting program(s), having computer-readable program code,may be embodied within one or more computer-usable media such as memorydevices or transmitting devices, thereby making a computer programproduct or article of manufacture according to the invention. As such,the terms “article of manufacture” and “computer program product” asused herein are intended to encompass a computer program existent(permanently, temporarily, or transitorily) on any computer-usablemedium such as on any memory device or in any transmitting device.

[0111] Executing program code directly from one medium, storing programcode onto a medium, copying the code from one medium to another medium,transmitting the code using a transmitting device, or other equivalentacts may involve the use of a memory or transmitting device which onlyembodies program code transitorily as a preliminary or final step inmaking, using, or selling the invention.

[0112] Memory devices include, but are not limited to, fixed (hard) diskdrives, floppy disks (or diskettes), optical disks, magnetic tape,semiconductor memories such as RAM, ROM, Proms, etc. Transmittingdevices include, but are not limited to, the Internet, intranets,electronic bulletin board and message/note exchanges, telephone/modembased network communication, hard-wired/cabled communication network,cellular communication, radio wave communication, satellitecommunication, and other stationary or mobile networksystems/communication links.

[0113] A machine embodying the invention may involve one or moreprocessing systems including, but not limited to, CPU, memory/storagedevices, communication links, communication/transmitting devices,servers, I/O devices, or any subcomponents or individual parts of one ormore processing systems, including software, firmware, hardware, or anycombination or subcombination thereof, which embody the invention as setforth in the claims.

[0114] It will be appreciated that various other alternatives,modifications, variations, improvements or other such equivalents of theteachings herein that may be presently unforeseen, unappreciated orsubsequently made by others are also intended to be encompassed by theclaims.

What is claimed is:
 1. A dynamic programming method for computing ameasure of similarity between a first sequence of symbols and a secondsequence of symbols, comprising: allocating memory for a computationalunit for storing values that are computed using a recursive formulationthat computes the measure of similarity based on matching subsequencesof symbols between the first sequence of symbols and the second sequenceof symbols; computing with a processor for the computational unit thevalues for the measure of similarity using the recursive formulationwithin which functions are computed using nested loops that include: anouter loop that ranges over increasing sums of prefix lengths of thefirst sequence of symbols and the second sequence of symbols, a middleloop that ranges over increasing prefixes of the first sequence ofsymbols, for each sum of prefix lengths of the outer loop, and an innerloop that ranges over increasing subsequence lengths, for each prefix ofthe first sequence of symbols of the middle loop; and outputting themeasure of similarity.
 2. The method according to claim 1, wherein themeasure of similarity is output to an information processingapplication.
 3. The method according to claim 2, wherein the informationprocessing application uses the measure of similarity for performing oneof information clustering, classification, cross-lingual informationretrieval, routing, text comparison and filtering.
 4. The methodaccording to claim 2, wherein a first function of the recursiveformulation is given by:${{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}}}}},{where}$

n: is a subsequence length, K_(n): is the measure of similarity forsubsequences of length n, s: is any sequence of symbols, t: is any othersequence of symbols, sx: is the sequence of symbols obtained byappending the symbol x to the end of the sequence of symbols t, λ: is adecay factor for penalizing matching subsequences that arenoncontiguous; t[1:j−1]: refers to the first j−1 symbols of t,$\sum\limits_{{j:t_{j}} = x}:$

is a sum that ranges over all indices j such that t_(j) is x, K′: is asecond function of the recursive formulation that is given by: K _(i)^(′)(sx,t)=λK _(i) ^(′)(s,t)+K _(i) ^(′() sx,t) (i=1, . . . ,n−1), andK″: is a third function of the recursive formulation that is given by:${K_{i}^{''}\left( {{sx},{ty}} \right)} = \left\{ {\begin{matrix}{\lambda \left( {{K_{i}^{''}\left( {{sx},t} \right)} + {\lambda \quad {K_{i - 1}^{\prime}\left( {s,t} \right)}}} \right)} & {{{if}\quad x} = y} \\{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} & {otherwise}\end{matrix}.} \right.$


5. The method according to claim 4, further comprising defining basevalues for the recursive formulation to be: K ₀ ^(″)(s,t)=0, for all s,tK _(i) ^(″)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1) K ₀^(′)(s,t)=1, for all s,t, K _(i) ^(′)(s,t)=0, if min(|s|,|t|)<i, (i=1, .. . ,n−1) _(n)(s,t)=0, if min(|s|,|t|)<n, where |s|: is the cardinalityof s, and |t|: is the cardinality of t.
 6. The method according to claim5, wherein the memory allocated for storing intermediate values of K′and K″ is sufficient for storing up to (n×|s|+n×|t|) values in thememory.
 7. The method according to claim 2, wherein ones of the matchingsubsequences of symbols are noncontiguous matches of symbols between thefirst sequence of symbols and the second sequence of symbols.
 8. Themethod according to claim 7, wherein the noncontiguous occurrences ofmatching subsequences of symbols are penalized according to how manygaps exist between matching symbols.
 9. The method according to claim 2,wherein the measure of similarity is computed for all possiblesubsequences of symbols between the first sequence of symbols and thesecond sequence of symbols.
 10. An apparatus for computing a measure ofsimilarity between a first sequence of symbols and a second sequence ofsymbols, comprising: a memory for storing: (a) values that are computedusing a recursive formulation that computes the measure of similaritybased on matching subsequences of symbols between the first sequence ofsymbols and the second sequence of symbols, and (b) processinginstructions of a computational unit for carrying out the recursiveformulation; and a processor coupled to the memory for executing theprocessing instructions of the computational unit; the processor inexecuting the processing instructions computing the values for themeasure of similarity using the recursive formulation within whichfunctions are computed using nested loops that include: an outer loopthat ranges over increasing sums of prefix lengths of the first sequenceof symbols and the second sequence of symbols, a middle loop that rangesover increasing prefixes of the first sequence of symbols, for each sumof prefix lengths of the outer loop, and an inner loop that ranges overincreasing subsequence lengths, for each prefix of the first sequence ofsymbols of the middle loop.
 11. The apparatus according to claim 10,further comprising an information processing application for receivingthe measure of similarity from the computational unit.
 12. The apparatusaccording to claim 11, wherein the information processing applicationuses the measure of similarity for performing one of informationclustering, classification, cross-lingual information retrieval,routing, text comparison and filtering.
 13. The apparatus according toclaim 11, wherein a first function of the recursive formulation is givenby:${{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {1:{j - 1}} \right\rbrack}} \right)}}}},$

n: is a subsequence length, K_(n): is the measure of similarity forsubsequences of length n, s: is any sequence of symbols, t: is any othersequence of symbols, sx: is the sequence of symbols obtained byappending the symbol x to the end of the sequence of symbols t, λ: is adecay factor for penalizing matching subsequences that arenoncontiguous; t[1:j−1]: refers to the first j−1 symbols of t,$\sum\limits_{{j:t_{j}} = x}:$

is a sum that ranges over all indices j such that t_(j) is x, K′: is asecond function of the recursive formulation that is given by: K _(i)^(′)(sx,t)=λK _(i) ^(′)(s,t)+K _(i) ^(″)(sx,t) (i=1, . . . ,n−1), andK″: is a third function of the recursive formulation that is given by:${K_{i}^{''}\left( {{sx},{ty}} \right)} = \left\{ {\begin{matrix}{\lambda \left( {{K_{i}^{''}\left( {{sx},t} \right)} + {\lambda \quad {K_{i - 1}^{\prime}\left( {s,t} \right)}}} \right)} & {{{if}\quad x} = y} \\{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} & {otherwise}\end{matrix}.} \right.$


14. The apparatus according to claim 13, wherein base values for therecursive formulation are defined to be: K ₀ ^(″)(s,t)=0, for all s,t K_(i) ^(″)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1) K ₀ ^(′)(s,t)=1,for all s,t, K _(i) ^(′)(s,t)=0, if min(|s|,|t|)<i, (i=1, . . . ,n−1) K_(n)(s,t)=0, if min(|s|,|t|)<n, where |s|: is the cardinality of s, and|t|: is the cardinality of t.
 14. The method according to claim 15,wherein the memory allocated for storing intermediate values of K′ andK″ is sufficient for storing up to (n×|s|+n×|t|) values in the memory.15. The apparatus according to claim 11, wherein ones of the matchingsubsequences of symbols are noncontiguous matches of symbols between thefirst sequence of symbols and the second sequence of symbols, and thenoncontiguous occurrences of matching subsequences of symbols arepenalized according to how many gaps exist between matching symbols. 16.The apparatus according to claim 11, wherein the measure of similarityis computed for all possible subsequences of symbols between the firstsequence of symbols and the second sequence of symbols.
 17. An articleof manufacture for use in a machine comprising: a) a memory; b)instructions stored in the memory for computing a measure of similaritybetween a first sequence of symbols and a second sequence of symbolscomprising: allocating memory for a computational unit for storingvalues that are computed using a recursive formulation that computes themeasure of similarity based on matching subsequences of symbols betweenthe first sequence of symbols and the second sequence of symbols;computing for the computational unit with a processor the values for themeasure of similarity using the recursive formulation within whichfunctions are computed using nested loops that include: an outer loopthat ranges over increasing sums of prefix lengths of the first sequenceof symbols and the second sequence of symbols, a middle loop that rangesover increasing prefixes of the first sequence of symbols, for each sumof prefix lengths of the outer loop, and an inner loop that ranges overincreasing subsequence lengths, for each prefix of the first sequence ofsymbols of the middle loop; and outputting the measure of similarity.18. The article of manufacture according to claim 17, wherein themeasure of similarity is output to an information processing applicationthat uses the measure of similarity for performing one of informationclustering, classification, cross-lingual information retrieval,routing, text comparison and filtering.
 19. The article of manufactureaccording to claim 18, wherein a first function of the recursiveformulation is given by:${{K_{n}\left( {{sx},t} \right)} = {{K_{n}\left( {s,t} \right)} + {\sum\limits_{{j:t_{j}} = x}{\lambda^{2}{K_{n - 1}^{\prime}\left( {s,{t\left\lbrack {{1\text{:}j} - 1} \right\rbrack}} \right)}}}}},$

n: is a subsequence length, K_(n): is the measure of similarity forsubsequences of length n, s: is any sequence of symbols, t: is any othersequence of symbols, sx: is the sequence of symbols obtained byappending the symbol x to the end of the sequence of symbols t, λ: is adecay factor for penalizing matching subsequences that arenoncontiguous; t[1:j−1]: refers to the first j−1 symbols of t,$\sum\limits_{{j:t_{j}} = x}\text{:}$

is a sum that ranges over all indices j such that t_(j) is x, K′: is asecond function of the recursive formulation that is given by: K _(i)^(′)(sx,t)=λK _(i) ^(′)(s,t)+K _(i) ^(″)(sx,t) (i=1, . . . ,n−1), andK″: is a third function of the recursive formulation that is given by:${K_{i}^{''}\left( {{sx},{ty}} \right)} = \left\{ {\begin{matrix}{\lambda \left( {{K_{i}^{''}\left( {{sx},t} \right)} + {\lambda \quad {K_{i - 1}^{\prime}\left( {s,t} \right)}}} \right)} & {{{if}\quad x} = y} \\{\lambda \quad {K_{i}^{''}\left( {{sx},t} \right)}} & {otherwise}\end{matrix};{and}} \right.$

wherein base values for the recursive formulation are defined to be: K ₀^(″)(s,t)=0, for all s,t K _(i) ^(″)(s,t)=0, if min(|s|,|t|)<i, (i=1, .. . ,n−1) K ₀ ^(′)(s,t)=1, for all s,t, K _(i) ^(′)(s,t)=0, ifmin(|s|,|t|)<i, (i=1, . . . ,n−1) K _(n)(s,t)=0, if min(|s|,|t|)<n,where |s|: is the cardinality of s, and |t|: is the cardinality of t.20. The method according to claim 19, wherein the memory allocated forstoring intermediate values of K′ and K″ is sufficient for storing up to(n×|s|+n×|t|) values in the memory.